17 research outputs found
Recommended from our members
Parsing Early Modern English for Linguistic Search
This work addresses the question of whether the output of a state-of-the-art parser is accurate enough to support research in theoretical linguistics. In order to build reliable models of syntactic change, we aim to eventually parse the 1.5-billion-word Early English Books Online (EEBO) corpus. But since EEBO is not yet parsed, we begin by constructing and testing a parser on the 1.7-million-word Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). In order to obtain robust results, we define an 8-fold split on PPCEME. We then evaluate the parser with evalb and, more relevantly for us, with a task-specific metric - namely, its accuracy in parsing 6 sentence types necessary to track the rise of auxiliary do (as in They did not come vs. its historical precursor They came not). Retrieving the relevant sentences from the gold and test versions with CorpusSearch queries, we find that the parser\u27s accuracy promises to be sufficient for our purposes. A remaining concern is the variability of the output, which we plan to address with three pieces of future work sketched in the conclusion
Recommended from our members
Parsing Early English Books Online for Linguistic Search
This work addresses the question of how to evaluate a state-of-the-art parser on Early English Books Online (EEBO), a 1.5-billion-word collection of unannotated text, for utility in linguistic research. Earlier work has trained and evaluated a parser on the 1.7-million-word Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) and defined a query-based evaluation to score the retrieval of 6 specific sentence types of interest. However, significant differences between EEBO and the manually-annotated PPCEME make it inappropriate to assume that these results will generalize to EEBO. Fortunately, an overlap of source material in PPCEME and EEBO allows us to establish a token alignment between them and to score the POS-tagging on EEBO. We use this alignment together with a more principled version of the query-based evaluation to score the recovery of sentence types on this subset of EEBO, thus allowing us to estimate the increase in error rate on EEBO compared to PPCEME. The increase is largely due to differences in sentence segmentation between the two corpora, pointing the way to further improvements
A Part-of-Speech Tagger for Yiddish: First Steps in Tagging the Yiddish Book Center Corpus
We describe the construction and evaluation of a part-of-speech tagger for
Yiddish (the first one, to the best of our knowledge). This is the first step
in a larger project of automatically assigning part-of-speech tags and
syntactic structure to Yiddish text for purposes of linguistic research. We
combine two resources for the current work - an 80K word subset of the Penn
Parsed Corpus of Historical Yiddish (PPCHY) (Santorini, 2021) and 650 million
words of OCR'd Yiddish text from the Yiddish Book Center (YBC). We compute word
embeddings on the YBC corpus, and these embeddings are used with a tagger model
trained and evaluated on the PPCHY. Yiddish orthography in the YBC corpus has
many spelling inconsistencies, and we present some evidence that even simple
non-contextualized embeddings are able to capture the relationships among
spelling variants without the need to first "standardize" the corpus. We
evaluate the tagger performance on a 10-fold cross-validation split, with and
without the embeddings, showing that the embeddings improve tagger performance.
However, a great deal of work remains to be done, and we conclude by discussing
some next steps, including the need for additional annotated training and test
data
CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings
Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we
organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6).
The new challenge revisits the previous CHiME-5 challenge and further considers
the problem of distant multi-microphone conversational speech diarization and
recognition in everyday home environments. Speech material is the same as the
previous CHiME-5 recordings except for accurate array synchronization. The
material was elicited using a dinner party scenario with efforts taken to
capture data that is representative of natural conversational speech. This
paper provides a baseline description of the CHiME-6 challenge for both
segmented multispeaker speech recognition (Track 1) and unsegmented
multispeaker speech recognition (Track 2). Of note, Track 2 is the first
challenge activity in the community to tackle an unsegmented multispeaker
speech recognition scenario with a complete set of reproducible open source
baselines providing speech enhancement, speaker diarization, and speech
recognition modules
CHiME-6 Challenge: Tackling multispeaker speech recognition for unsegmented recordings
International audienceFollowing the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules
SCALE-SPACE EXPANSION OF ACOUSTIC FEATURES IMPROVES SPEECH EVENT DETECTION
ABSTRACT In a system for detecting and measuring phonetic events (here bursts, voice onsets, and voice-onset times), we show that the addition of features smoothed at multiple scales can improve both recall (the proportion of events correctly identified) and measurement accuracy (the timing of events and the difference between event times, relative to expert human judgments). Multi-scale (or "scale space") features had an especially strong positive effect on robustness across datasets with different materials and recording conditions. Standard machine-learning classifiers were able to integrate information across scales, without any special treatment of the multiscale features